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Abstract: We present the discrete infinite logistic normal distribution (DILN), a Bayesian 
nonparametric prior for mixed membership models. DILN generalizes the hierarchical 
Dirichlet process (HDP) to model correlation structure between the weights of the atoms at 
the group level. We derive a representation of DILN as a normalized collection of gamma- 
distributed random variables and study its statistical properties. We derive a variational 
inference algorithm for approximate posterior inference. We apply DILN to topic modeling 
of documents and study its empirical performance on four corpora, comparing performance 
with the HDP and the correlated topic model (CTM). To compute with large-scale data, 
we also develop a stochastic variational inference algorithm for DILN and compare with 
similar algorithms for HDP and LDA on a collection of 350, 000 articles from Nature. 

1. Introduction 

The hierarchical Dirichlet process (HDP) has emerged as a powerful Bayesian nonparametric 
prior for grouped data (Teh et al, 2006), particularly in its role in Bayesian nonparametric 
mixed-membership models. In an HDP mixed-membership model, each group of data is modeled 
with a mixture where the mixture proportions are group-specific and the mixture components 
are shared across the data. While finite models require the number of mixture components to be 
fixed in advance, the HDP model allows the data to determine how many components are needed. 
And that number is variable: With an HDP model, new data can induce new components. 

The HDP mixed-membership model has been widely applied to probabilistic topic modeling, 
where hierarchical Bayesian models are used to analyze large corpora of documents in the service 
of exploring, searching, and making predictions about them (Blei and Lafferty, 2007, 2009; Blei, 
Ng and Jordan, 2003; Erosheva, Fienberg and Lafferty, 2004; Griffiths and Steyvers, 2004). In 
topic modeling, documents are grouped data — each document is a group of observed words — and 
we analyze the documents with a mixed-membership model. Conditioned on a collection, the 
posterior expectation of the mixture components are called "topics" because they tend to resem- 
ble the themes that pervade the documents; the posterior expectation of the mixture proportions 
identify how each document exhibits the topics. Bayesian nonparametric topic modeling uses an 
HDP to try to solve the model selection problem; the the number of topics is determined by the 
data and new documents can exhibit new topics. 

For example, consider using a topic model to analyze 10,000 articles from Wikipedia. (This 
is a data set that we will return to.) At the corpus level, the posterior of one component might 
place high probability on terms associated with elections; another might place high probability 
on terms associated with the military. At the document level, articles that discuss both subjects 
will have posterior proportions that place weight on both topics. The posterior of these quantities 
over the whole corpus can be used to organize and summarize Wikipedia in a way that is not 
otherwise readily available. 

Though powerful, the HDP mixed-membership model is limited in that it does not explicitly 
model the correlations between the mixing proportions of any two components. For example, the 
HDP topic model cannot capture that the presence of the election topic in a document is more 
positively correlated with the presence of the military topic than it is a topic about mathematics. 
Capturing such patterns, i.e., representing that one topic might often co-occur with another, can 
provide richer exploratory variables to summarize the data and further improve prediction. 
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To address this, we developed the discrete infinite logistic normal distribution (DILN, pro- 
nounced "Dylan"), a Bayesian nonparametric prior for mixed- membership models (Paisley, 
Wang and Blei, 2011). 1 As with the HDP, DILN generates discrete probability distributions 
on an infinite set of components, where the same components are shared across groups but have 
differently probabilities within each group. Unlike the HDP, DILN also models the correlation 
structure between the probabilities of the components. 

Figure 1 illustrates the DILN posterior for 10,000 articles from Wikipedia. The corpus is 
described by a set of topics — each topic is a distribution over words and is visualized by listing 
the most probable words — and the topics exhibit a correlation structure. For example, topic 
3 ("party, election, vote") is correlated with topic 12 ("constitution, parliament, council") and 
topic 25 ("coup, army, military"). It is negatively correlated with topic 20 ("food, meat, drink"). 

In DILN, each component is associated with a parameter (e.g., a topical distribution over 
terms) and a location in a latent space. For group-level distributions (e.g., document-specific 
distributions over topics) , the correlation between component weights is determined by a kernel 
function of latent locations of these components. Since the correlation between occurrences is 
a posterior correlation, i.e., one that emerges from the data, the locations of the components 
are also latent. For example, we do not enforce a priori what the topics are and how they are 
correlated — this structure comes from the posterior analysis of the text. 

We formulate two equivalent representations of DILN. We first formulate it as an HDP scaled 
by a Gaussian process (Rasmussen and Williams, 2006). This gives an intuitive picture of how the 
correlation between component weights enters the distribution and makes clear the relationship 
between DILN and the HDP. We then formulate DILN as a member of the normalized gamma 
family of random probability distributions. This lets us characterize the a priori correlation 
structure of the component proportions. 

The central computational problem for DILN is approximate posterior inference. Given a cor- 
pus, we want to compute the posterior distribution of the topics, per-document topic proportions, 
and the latent locations of the topics. Using normalized the gamma construction of a random 
measure, we derive a variational inference algorithm (Jordan et al, 1999) to approximate the 
full posterior of a DILN mixed-membership model. (Moreover, this variational algorithm can 
be modified into a new posterior inference algorithm for HDP mixed- membership models.) We 
use variational inference to analyze several collections of documents, each on the order of thou- 
sands of articles, determining the number of topics based on the data and identifying an explicit 
correlation structure among the discovered topics. On four corpora (collected from Wikipedia, 
Science, The New York Times, and The Huffington Post), we demonstrate that DILN provides 
a better predictive model and an effective new method for summarizing and exploring text data. 
(Again, see Figure 1 and also Figures 4, 5 and 6.) 

Variational inference turns the problem of approximating the posterior into an optimization 
problem. Recent research has used stochastic optimization to scale variational inference up to 
very large data sets (Armagan and Dunson, 2011; Hoffman, Blei and Bach, 2010), including our 
own research on HDP mixed-membership models (Wang, Paisley and Blei, 2011). We used the 
same strategy here to develop a scalable inference algorithm for DILN. This further expands 
the scope of stochastic variational inference to models (like DILN) whose latent variables do 
not enjoy pair-wise conjugacy. Using stochastic inference, we analyzed 352,549 thousand articles 
from Nature magazine, a corpus which would be computationally expensive with our previous 
variational algorithm. 

1 In this paper we expand on the ideas of Paisley, Wang and Blei (2011), which is a short conference paper. We 
report on new data analysis, we describe a model of the latent component locations that allows for variational 
inference, we improve the variational inference algorithm (see Section 3.4), and we expand it to scale up to very 
large data sets. 
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Topic 1: economy, economic, growth, industry, sector, rate, export, production, million, billion 
Topic 2: international, nations, republic, agreement, relation, foreign, union, nation, china, economic 
Topic 3: party, election, vote, elect, president, democratic, political, win, minister, seat 
Topic 4: season, team, win, league, game, championship, align, football, stadium, record 
Topic 5: treatment, patient, disease, drug, medical, health, effect, risk, treat, symptom 
Topic 6: album, music, band, record, song, rock, release, artist, recording, label 

Topic 7: philosophy, philosopher, thing, argument, philosophical, mind, true, truth, reason, existence 

Topic 8: law, court, legal, criminal, person, rule, jurisdiction, judge, crime, rights 

Topic 9: math, define, function, theorem, clement, definition, space, property, theory, sub 

Topic fO: church, christian, christ, jesus, catholic, roman, john, god, orthodox, testament 

Topic ff: climate, mountain, land, temperature, range, region, dry, south, forest, zone 

Topic 12: constitution, parliament, council, appoint, assembly, minister, head, legislative, house 

Topic f3: cell, protein, acid, molecule, structure, process, enzyme, dna, membrane, bind 

Topic f4: atom, element, chemical, atomic, electron, energy, hydrogen, reaction, sup, sub 

Topic 15: computer, memory, processor, design, hardware, machine, unit, chip, ibm, drive 

Topic f6: president, congress, Washington, governor, republican, john, george, federal, senator, senate 

Topic 17: military, army, air, unit, defense, navy, service, operation, armed, personnel 

Topic f8: university, student, school, education, college, program, degree, institution, science, graduate 

Topic 19: math, value, values, measure, equal, calculate, probability, define, distribution, function 

Topic 20: food, meat, drink, fruit, eat, vegetable, water, dish, traditional, ingredient 

Topic 21: battle, commander, command, army, troop, victory, attack, british, officer, campaign 

Topic 22: sport, ball, team, score, competition, match, player, rule, tournament, event 

Topic 23: airport, rail, traffic, road, route, passenger, bus, service, transportation, transport 

Topic 24: religion, god, spiritual, religious, belief, teaching, divine, spirit, soul, human 

Topic 25: coup, army, military, leader, overthrow, afghanistan, armed, kill, rebel, regime 

Topic 26: god, goddess, greek, kill, myth, woman, story, sacrifice, ancient, away 

Topic 27: economic, political, argue, society, social, revolution, free, economics, individual, capitalism 
Topic 28: radio, service, television, network, station, broadcast, telephone, internet, channel, mobile 
Topic 29: equation, math, linear, constant, coordinate, differential, plane, frac, solution, right 
Topic 30: university, professor, prize, award, nobel, research, publish, prise, science, society 



Fig 1. Topic correlation for a 10K document Wikipedia corpus: The ten most probable words from the 30 most 
probable topics. At top are the positive and negative correlation coefficients for these topics (separated for clarity) 
as learned by the topic locations (see text for details). 



Related research. The parametric model most closely related to DILN is the correlated topic 
model (CTM) (Blei and Lafferty, 2007). The CTM is a mixed- membership model that allows 
topic occurrences to exhibit correlation. The CTM replaces the Dirichlet prior over topic propor- 
tions, which assumes near independence of the components, with a logistic normal prior (Aitchi- 
son, 1982). Logistic normal vectors are generated by exponentiating a multivariate Gaussian 
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vector and normalizing to form a probability vector. The covariance matrix of the multivariate 
Gaussian distribution provides a means for capturing correlation structure between topic prob- 
abilities. Our goal in developing DILN was to form a Bayesian nonparametric variant of this 
kind of model. 

The natural nonparametric extension of the logistic normal is a normalized exponentiated 
Gaussian process (Lenk, 1988; Rasmussen and Williams, 2006). However, this cannot function 
as a prior for nonparametric correlated topic modeling. The key property of the HDP (and DILN) 
is that the same set of components are shared among the groups. This sharing arises because 
the group-level distributions on the infinite topic space are discrete probability measures over 
the same set of atoms. Using the model of Lenk (1988) in a hierarchical setting does not provide 
such distributions. The "infinite CTM" is therefore not a viable alternative to the HDP. 

In the Bayesian nonparametric literature, another related line of work focuses on dependent 
probability distributions where the dependence is defined on predictors observed for each data 
point. MacEachern (1999) introduced dependent Dirichlet processes (DDPs), which allow data- 
dependent variation in the atoms of the mixture, and have been applied to spatial modeling 
(Gelfand, Kottas and MacEachern, 2005; Rao and Teh, 2009). Other dependent priors allow the 
mixing weights themselves to vary with predictors (Duan, Guindani and Gelfand, 2007; Dunson 
and Park, 2008; Griffin and Steel, 2006; Ren et al, 2011). Still other methods consider the 
weighting of multiple DP mixture models using spatial information (Dunson, Pillai and Park, 
2007; Muller, Quintana and Rosner, 2004). 

These methods all use the spatial dependence between observations to construct observation- 
specific probability distributions. Thus they condition on known locations (often geospatial) for 
the data. In contrast, the latent locations of each component in DILN do not directly interact 
with the data, but with each other. That is, the correlations induced by these latent locations 
influence the mixing weights for a data group prior to producing its observations in the generative 
process. Unlike DDP models, our observations are not equipped with locations and do not a 
priori influence component probabilities. The modeling ideas behind DILN and behind DDPs 
are separate, though it is possible to develop dependent DILN models, just as dependent HDP 
models have been developed (MacEachern, 1999). 

This paper is organized as follows. In Section 2 we review the HDP and discuss its representa- 
tion as a normalized gamma process. In Section 3 we present the discrete infinite logistic normal 
distribution, first as a scaling of an HDP with an exponentiated Gaussian process and then 
using a normalized gamma construction. In Section 4 we use this gamma construction to derive 
a mean-field variational inference algorithm for approximate posterior inference of DILN topic 
models, and we extend this algorithm to the stochastic variational inference setting. Finally, in 
Section 5 we provide an empirical study of the DILN topic model on five text corpora. 

2. Background: The Hierarchical Dirichlet Process 

The discrete infinite logistic normal (DILN) prior for mixed-membership models is an extension 
of the hierarchical Dirichlet process (HDP) (Teh et al, 2006). In this section, we review the 
HDP and reformulate it as a normalized gamma process. 

2.1. The original formulation of the hierarchical Dirichlet process 

The Dirichlet process (Ferguson, 1973) is useful as a Bayesian nonparametric prior for mixture 
models since it generates distributions on infinite parameter spaces that are almost surely dis- 
crete (Blackwell and MacQueen, 1973; Sethuraman, 1994). Given a space f2 with a corresponding 
Borel o"-algebra B and base measure aGo, where a > and Go is a probability measure, Fergu- 
son (1973) proved the existence of a process G on (£l,B) such that for all measurable partitions 
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{B U ...,B K } of n, 

. . . , G(B K )) ~ Dirichlet(aG Q (£i), . . . , aG (^)). (1) 

This is called a Dirichlet process and is denoted G ~ DP(aGo). Sethuraman (1994) gave a proof 
of the almost sure (a.s.) discreteness of G by way of a stick-breaking representation (Ishwaran 
and James, 2001); we will review this stick-breaking construction later. Blackwell and MacQueen 
(1973) gave an earlier proof of this discreteness using Polya urn schemes. The discreteness of G 
allows us to write it as 

oo 

G = ^2pk$ Vk , 
fc=l 

where each atom rjk is generated i.i.d. from the base distribution Go, and the atoms are given 
random probabilities pk whose distribution depends on a scaling parameter a > such that 
smaller values of a lead to distributions that place more mass on fewer atoms. The DP is most 
commonly used as a prior for a mixture model, where Go is a distribution on a model parameter, 
G ~ DP(aGo) and each data point is drawn from a distribution family indexed by a parameter 
drawn from G (Ferguson, 1983; Lo, 1984). 

When the base measure Go is non-atomic, multiple draws from the DP prior place their 
probability mass on an a.s. disjoint set of atoms. That is, for G\,G2 *~ DP(aGo), an atom % 
in Gi will a.s. not appear in G2, i.e., Gi({r/k}) > G2({r]k}) = a.s. The goal of mixed- 

membership modeling is to use all groups of data to learn a shared set of atoms. The hierarchical 
Dirichlet process (Teh et al., 2006) was introduced to allow multiple Dirichlet processes to share 
the same atoms. The HDP is a prior for a collection of random distributions (Gi, . . . , Gm )• Each 
G m is i.i.d. DP distributed with a base probability measure that is also a Dirichlet process, 

G-DP(aGo), G m |G ~ DP(/3G). (2) 

The hierarchical structure of the HDP ensures that each G m has probability mass distributed 
across a shared set of atoms, which results from the a.s. discreteness of the second-level base 
measure /3G. Therefore, the same subset of atoms will be used by all groups of data, but with 
different probability distributions on these atoms for each group. 

Where the DP allows us to define a mixture model, the HDP allows us to define a mixed- 
membership model. Given an HDP (Gi, . . . , Gm), each G m generates its associated group of 
data from a mixture model, 

x M|0(m) *tf f(X\9W), n = l,...,N m , (3) 
0M|G m « d Gm , n = l,...,N m . (4) 

The datum X n m ^ denotes the nth observation in the rath group and 6 n m ^ denotes its associated 
parameter drawn from the mixing distribution G m , with Pr(#i m ' ) = rjk\G m ) = Gm({rjk})- The 
HDP can be defined to an arbitrary depth, but we focus on the two-level process described 
above. 

When used to model documents, the HDP is a prior for topic models. The observation X n m ^ 
is the nth word in the mth document and is drawn from a discrete distribution on words in 
a vocabulary, X n m ^\6 n m ^ ~ Discrete(6>i m ^), where 9^ is the U-dimensional word probability 
vector selected according to G m by its corresponding word. The base probability measure Go 
is usually a symmetric Dirichlet distribution on the vocabulary simplex. Given a document 
collection, posterior inference yields a set of shared topics and per-document proportions over 
all topics. Unlike its finite counterpart, latent Dirichlet allocation (Blei, Ng and Jordan, 2003), 
the HDP topic model determines the number of topics from the data (Teh et al, 2006). 
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2.2. The HDP as a normalized gamma process 

The DP has several representations, including a gamma process representation (Ferguson, 1973) 
and a stick-breaking representation (Sethuraman, 1994). In constructing HDPs, we will take 
advantage of each of these representations at different levels of the hierarchy. 
We construct the top-level DP using stick-breaking (Sethuraman, 1994), 

oo k— 1 

G = Y J VkJ{^-V j )5 11k , Vfc~Beta(l,a), Tfc~G . (5) 
k=i j=i 

The name comes from an interpretation of Vt as the proportion broken from the remainder of a 
unit-length stick fX/Z|(l The resulting absolute length of this stick forms the probability 

of atom Letting p k :=\% YljZx (1 ~~ Vj)> this method of generating DPs produces probability 
measures that are size- biased according to index k since E[p^] > E[p_j] for k < j. 

Turning to the second-level DP G m , we now use a normalized gamma process. Recall that 
a if-dimensional Dirichlet-distributed vector (Y\, . . . ,Yjc) ~ Dirichlet(ci, . . . , cjc) with a > 

and ^2jCj < oo can be generated for any value of K by drawing Z± Gamma(cj, 1) and 
defining Y{ := Z^j ]P ■ Zj (Ishwaran and Zarepour, 2002). Ferguson (1973) focused on the infinite 
extension of this representation as a normalized gamma process. Since p^ > for all atoms rjk 
in G, and also because Yl'jLiPPj = /3 < oo, we can construct each G m using the following 
normalization of a gamma process, 

oo v( m ) 

G m \G,Z = Y^ J (m) <W. Z^\G^ d G a mm a (!3 Pk ,l). (6) 

fc=i Ej=i z j 

The gamma process representation of the DP is discussed by Ferguson (1973), Kingman (1993) 
and Ishwaran and Zarepour (2002), but it has not been applied to the HDP. In DILN we will 
mirror this type of construction of the HDP — a stick-breaking construction for the top-level DP 
and a gamma process construction for the second-level DPs. This will let us better articulate 
model properties and also make inference easier. 

3. The Discrete Infinite Logistic Normal Distribution 

The HDP prior has the hidden assumption that the presence of one atom in a group is not a priori 
correlated with the presence of another atom (aside from the negative correlation imposed by the 
probability simplex). At the group level the HDP cannot model correlation structure between the 
components' probability mass. To see this, note that the gamma process used to construct each 
group- level distribution is an example of a completely random measure (Kingman, 1993). That 
is, the unnormalized masses (z[ m \ Z^ m \ . . . ) of the atoms (r/i, rj2, ■ ■ ■ ) of G m are independently 
drawn, and for all partitions {B\, . . . of SI and given S m := Ylj %j ■> t ne scaled random 

variables S m G m (Bi), . . . , S m G m (Bx) are independent. Thus, no correlation between per-group 
probabilities can be built into the HDP. 

We introduced the discrete infinite logistic normal (DILN) as a modification of the HDP that 
can express such correlations (Paisley, Wang and Blei, 2011). The idea is that each atom lives in 
a latent location, and the correlation between atom probabilities is determined by their relative 
locations in the latent space. When analyzing data, modeling these correlations can improve 
the predictive distribution and provide more information about the underlying latent structure. 
DILN has two equivalent representations; we first describe it as a scaled HDP, with scaling 
determined by an exponentiated Gaussian process (Rasmussen and Williams, 2006). We then 
show how DILN fits naturally within the family of normalized gamma constructions of discrete 
probability distributions in a way similar to the discussion in Section 2.2 for the HDP. 
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notation description 



mean and kernel functions for GP 

a draw from GP(m,K) 

concentration parameters 

top-level stick-breaking proportions 

atom proportions from gamma process 

topic and its location 

base distribution for topics and locations 

topic index for words 

observed words 



Fig 2. A graphical model of the normalized gamma construction of the DILN topic model. 



3.1. DILN as a scaled HDP 



DILN shares the same hierarchical structure described in Section 2.2 for the HDP — there is an 
infinite set of components and each group exhibits those components with different probabilities. 
In DILN, we further associate each component with a latent location in W 1 . (The dimension d 
is predefined.) The model then uses these locations to influence the correlations between the 
probabilities of the components for each group-level distribution. In posterior inference, we infer 
both the components and their latent locations. Thus, through the inferred locations, we can 
estimate the correlation structure among the components. 

Let Go be a base distribution over parameter values r/ E Q, and let Lq be a non-atomic base 
distribution over locations, £ E M d . We first draw a top-level Dirichlet process with a product 
base measure aGn x Lq, 

G~DP(aG xL ). (7) 

Here, G is a probability measure on the space x M. d . For each atom {r], 1} £(!x we think 
of i] E as living in the parameter space, and £ E W 1 as living in the location space. 

In the second level of the process, the model uses both the probability measure G and the 
locations of the atoms to construct group-level probability distributions. This occurs in two 
steps. In the first step, we independently draw a Dirichlet process and a Gaussian process using 
the measure and atoms of G, 



^DP I n 



DP(/3G), W m (£)~GP(/i(*),K(^)). 



(8) 



The Dirichlet process Gj^ p provides a new, initial distribution on the atoms of G for group m. 
The Gaussian process W m is defined on the locations of the atoms of G and results in a random 
function that can be evaluated using the location of each atom. The covariance between W m (£) 
and W m (£') is determined by a kernel function K(£, £') on their respective locations. 

The second step is to form each group-level distribution by scaling the probabilities of each 
second-level Dirichlet process by the exponentiated values of its corresponding Gaussian process, 



G m ({r?,n) I G^,W m oc G^({ V ,£})exp{W m (£)}. 



(9) 



Since we define Go and Lq to be non-atomic, all rj and £ in G are a.s. distinct, and evaluating 
the Gaussian process W m at a location £ determines its atom {r/,£}. We satisfy two objectives 
with this representation: (i) the probability measure G m is discrete, owing to the discreteness of 
Gj^ p , and (ii) the probabilities in G m are explicitly correlated, due to the exponentiated Gaus- 
sian process. We emphasize that these correlations arise from latent locations and in posterior 
inference we infer these locations from data. 



3.2. A normalized gamma construction of DILN 



We now turn to a normalized gamma construction of DILN. We show that the DILN prior 
uses the second parameter of the gamma distribution in the normalized gamma construction of 
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the HDP to model the covariance structure among the components of G m . This representation 
facilitates approximate posterior inference described in Section 4, and helps clarify the covariance 
properties of the group-level distributions over atoms. 

We use a stick- breaking construction of the top-level Dirichlet process (Equation 7), 

oo k— 1 

G = J2 V ^U^- V ^ 6 {v k Ah V~Beta(l,a), % ~ G , £ k ~Lo. (10) 
k=i j=i 

This is nearly the same as the top-level construction of the HDP given in Equation (5). The 
difference is that the product base measure is defined over the latent location £ k as well as the 
component rjk to form the atom {rjk,£ k }- 

We pattern the group-level distributions after the gamma process construction of the second- 
level DP in the HDP, 

oo g(m) 

G m \G,Z = J2 * (m) hvuAh ( U ) 
k=i L,j=i 

Z ( k m) | G, W m ~ Gamma(/3p fc! exp{-W m (4)}) ! W m \ G ~ GP(/x(£),K(^/)), 

with p k := V k UjZi (l-Vj). Here, DILN differs from the HDP in that it uses the second parameter 
of the gamma distribution. In the appendix, we give a proof that the normalizing constant is 
almost surely finite. 

We note that the locations £ k contained in each atom no longer serve a function in the model 
after G m is constructed, but we include them in Equation (11) to be technically correct. The 
purpose of the locations {£ k } is to generate sequences Z[ m \zt\-.. that are correlated, which 
is not achieved by the HDP. After constructing the weights of G m , the locations have fulfilled 
their role and are no longer used downstream by the model. 

We derive Equation (11) using a basic property of gamma distributed random variables. 
Recall that the gamma density is f(z\a, b) = b a z a ~ x exp{— bz}/T(a). Consider a random variable 
y ~ Gamma(o, 1) that is scaled by b > to produce z = by. Then z ~ Gamma(a, b^ 1 ). In 
Equation (9) we scale atom {77, £} of the Dirichlet process by exp{W m (£)}. Using the 

gamma process representation of Gj^ p given in Equation (6) and the countably infinite G in 
Equation (10), we have that G m ({rjk,£k}) °c Y^ exp{W m (£)}, where ~ Gamma(/3pfc , 1). 
Since := ex.p{W m (£)} is distributed as Gamma(/3p&, exp{— W m (£k)}) by the above 

property of scaled gamma random variables, the construction in Equation (11) follows. 

For the topic model, drawing an observation proceeds as for the HDP. We use a latent indicator 
variable which selects the index of the atom used by observation X^™'. This indicator 

variable gives a useful hidden-data representation of the process for inference in mixture models 
(Escobar and West, 1995), 

oo y{m) 

I G^CW™ Discrete^)), C#"> | G m % ^ — * {m) 5 k , (12) 

k=i z2j=i z j 

where the discrete distribution is on word index values {1, . . . , V}. We note that this discrete 
distribution is one of many possible data generating distributions, and changing this distribu- 
tion and Go will allow for DILN to be used in a variety of other mixed-membership modeling 
applications (Airoldi et al, 2008; Erosheva, Fienberg and Joutard, 2007; Pritchard, Stephens 
and Donnelly, 2000). Figure 2 shows the graphical model of the DILN topic model. 
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3.3. The covariance structure of DILN 



The two-parameter gamma representation of DILN permits simple calculation of the expectation, 
variance and covariance prior to normalization. We first give these values conditioning on the 
top-level Dirichlet process G and integrating out the Gaussian process W m . In the following 
calculations, we assume that the mean function of the Gaussian process is /x(-) = and we 



define k 



,j .— JZ(£i,£j). The expectation, variance and covariance of 2^ and Zj" l> are 



E 



V 



Z^\P,p,K 

7 {m) 



■ ) r ' I. i 



(13) 



Zr>\(3,p, K = f3 Pi e 2k " + [3 2 p 2 e k " (e fc " - 1 



Cov 



Z^,Z^\(3,p,K 



/3 2 piPje 



(e k » - l) 



Observe that the covariance is similar to the unnormalized logistic normal (Aitchison, 1982), 
but with the additional term (3 2 PiPj- In general, these pi terms show how sparsity is enforced by 
the top-level DP, since both the expectation and variance terms go to zero exponentially fast as 
i increases. 

These values can also be calculated with the top-level Dirichlet process integrated out using 
the tower property of conditional expectation. They are 



E 
V 



zi m) \a,P,K 



m\pi]e 



lu.. 



(14) 



/3E[ K ]e 2fc « + f3 2 E[p 2 ]e 2k " - f3 2 E[ Pi } 2 e k " , 



Cov 



z\ m \zf l) \a,P,K 



p 2 E[p tPj ]e 



(3 2 E[p t ]E[ Pj }e 



2 ) 



The values of the expectations in Equation (14) are 



E[p( 



i-1 



(1 +ft) r 



Hp! 



2a 



i-1 



(l + a)(2 + ft) 



HpiPj] 



a 



i-1 



(2 + a)i(l + ft) i -J+ 1 ' 



i > j. 



Note that some covariance remains when kij = 0, since the conditional independence induced by 
p is no longer present. The available covariance structure depends on the kernel. For example, 
when a Gaussian kernel is used, a structured negative covariance is not achievable since kij > 0. 
We next discuss one possible kernel function, which we will use in our inference algorithm and 
experiments. 



3.4- Learning the kernel for DILN 

In our formulation of DILN, we have left the kernel function undefined. In principle, any kernel 
function can be used, but in practice some kernels yield simpler inference algorithms than others. 
For example, while a natural choice for K(£, £') is the Gaussian kernel, we found that the resulting 
variational inference algorithm was computationally expensive because it required many matrix 
inversions to infer the latent locations. 2 In this section, we define an alternative kernel. In the 
next section, we will see that this leads to simple algorithms for approximate inference of the 
latent locations t. 

2 In Paisley, Wang and Blei (2011) we side-stepped this issue by learning a point estimate of the matrix K, 
which was finite following a truncated approximation introduced for variational inference. We suggested finding 
locations by using an eigendecomposition of the learned K. The approach outlined here is more rigorous in that 
it stays closer to the model and is not tied to a particular approximate inference approach. 
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We model the location of a component with a zero- mean Gaussian vector in W 1 . We then 
form the kernel by taking the dot product of these vectors. That is, for components k and j, we 
draw locations and parameterize the Gaussian process for W m as 

4 ~ Normal(0,c/ d ), /i(4) = 0, K(4,4) = ^- (15) 

With this specification, all p-dimensional (p < d) sub-matrices of K are Wishart-distributed 
with parameters p and cl p (Dawid, 1981). However, this kernel is problematic. When the number 
of components p is greater than d, it will produce singular covariance matrices that cannot be 
inverted in the Gaussian likelihood function of W m , an inversion that is required during inference. 
While in parametric models we might place constraints on the number of components, our prior 
is nonparametric. We have an infinite number of components and therefore K must be singular. 

We solve this problem by forming an equivalent representation of the kernel in Equation (15) 
that yields a more tractable joint likelihood function. This representation uses auxiliary variables 
as follows. Let u ~ Normal(0, Id) and recall that for a vector z = B T u, the marginal distribution 
of z is z\ B ~ Normal(0,B T £). In our case, B T B is the inner product kernel and the columns 
of B correspond to component locations, B = ■ ■ ■ ]■ 

With this in mind, we use the following construction of the Gaussian process W m , 

W m (tk) = ftum, Um ~ Normal(0, I d ). (16) 

Marginalizing the auxiliary vector u m gives the desired W m (4) ~ GP(0, K(4, 4'))- 

The auxiliary vector u m allows for tractable inference of Gaussian processes that lie in a 
low-dimensional subspace. Aside from analytical tractability, the vector be interpreted 

as a location for group m. (This is not to be confused with the location of component k, 4-) 
The group locations let us measure similarity between groups, such as document similarity in 
the topic modeling case. In the following sections, we no longer work directly with W m (4)> but 
rather the dot product l\u m through inference of £ and u. 

4. Variational Inference for DILN 

In Bayesian nonparametric mixed-membership modeling, the central computational problem is 
posterior inference. However, computing the exact posterior is intractable. For HDP-based mod- 
els, researchers have developed several approximate methods (Liang et al, 2007; Teh, Kurihara 
and Welling, 2009; Teh et al, 2006; Wang, Paisley and Blei, 2011). 

In this paper, we derive a mean-field variational inference algorithm (Jordan et al, 1999; Wain- 
wright and Jordan, 2008) to approximate the posterior of a DILN mixed-membership model. We 
focus on topic modeling but note that our algorithm can be applied (with a little modification) 
to any DILN mixed-membership model. In addition, since the HDP is an instance of DILN, this 
algorithm also provides an inference method for HDP mixed-membership models. 

Variational methods for approximate posterior inference attempt to minimize the Kullback- 
Leibler divergence between a factorized distribution over the hidden variables and the true 
posterior. The hidden variables in the DILN topic model can be broken into document-level 
variables (those defined for each document), and corpus-level variables (those defined across 
documents); the document-level variables are the unnormalized weights topic indexes 

C n m \ and document locations u m ; the corpus- level variables are the topic distributions 
proportions Vk, concentration parameters a and [3, and topic locations 4- Under the mean- field 
assumption the variational distribution that approximates the full posterior is factorized, 

T 

Q := q{a)q{[3) \{ q(r,k)q(V k )q(£k) lC =l gC^M^M^)- ( 17 ) 
fe=l 
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Algorithm 1 Batch variational Bayes for DILN 
Batch optimization of the variational lower bound C 

Optimize corpus-wide and document-specific variational parameters 3'' and 
1: while and ^ m have not converged do 
2: for m = 1,...,M do 
3: Optimize * m (Equations 22-24) 
4: end for 

5: Optimize (Equations 25-29) 
6: end while 



We select the following variational distributions for each latent variable, 

q(C^) = Multinomial(^ m) |4 m) ) 

q(zi m) ) = Gamma(zf)|4 m) ,4 m) ) 
<?(%) = Dhichlet(7/ fc |7 fcj i,...,7 fciD ) 

q(Zk)q(u m ) = % • 8 Um 
«m = 5 Vk 

q(a)q(p) = 5 & -5 . (18) 

The set of parameters to these distributions are the variational parameters, represented by *S>. 
The goal of variational inference is to optimized these parameters to make the distribution 
Q close in KL divergence to the true posterior. Minimizing this divergence is equivalent to 
maximizing a lower bound on the log marginal likelihood obtained from Jensen's inequality, 



In 



J pCX, 6) oie > J Q(¥) In d@, (19) 



where stands for all hidden random variables. This objective has the form 

£(X,¥) =E Q [lnp(X,e)]+M[Q]. (20) 

We will find a locally optimal solution of this function using coordinate ascent, as detailed in 
the next section. 

Note that we truncate the number of components at T in the top-level Dirichlet process 
(Blei and Jordan, 2005). Kurihara, Welling and Vlassis (2006) show how infinite-dimensional 
objective functions can be defined for variational inference, but the conditions for this are not met 
by DILN. The truncation level T should be set larger than the total number of topics expected 
to be used by the data. A value of T that is set too small is easy to diagnose: the approximate 
posterior will use all T topics. Setting T large enough, the variational approximation will prefer 
a corpus-wide distribution on topics that is sparse. We contrast this with the CTM and other 
finite topic models, which fit a pre-specified number of topics to the data and potentially overfit 
if that number is too large. 

We have selected several delta functions as variational distributions. In the case of the top- 
level stick-breaking proportions and second- level concentration parameter j3, we have followed 
Liang et al. (2007) in doing this for tractability. In the case of the top-level concentration 
parameter a, and topic and document locations £k and u m , these choices simplify the algorithm. 

4-1. Coordinate ascent variational inference 

We now present the variational inference algorithm for the DILN topic model. We optimize the 
variational parameters ^ with respect to the variational objective function of Equation (20). 
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For DILN, the variational objective expands to 

M N m T M N m T 

m=ln=lfc=l m=ln=lfc=l 
M T T T 

+ EE Ejlnp(4 m) |/3 Pjb 4, n m )] + £ Ejlnp( % | 7 )] + £ E 3 [lnp(F fe |a)] 
m=lfc=l fe=l fe=l 

T M 

+ E E 9 [ ln P(4)] + E E 9[ ln ^ m )]+IE g [lnp(a)]+E (? [lnp(/3)]-EQ[lng]. (21) 

fc=l m=l 

We use coordinate ascent to optimize this function, iterating between two steps. In the first step 
we optimize the document-level parameters for each document; in the second step we optimize 
the corpus-level parameters. Algorithm 1 summarizes this general inference structure. 



Document-level parameters 

For each document, we iterate between updating the variational distribution of per- word topic 
indicators C n m \ unnormalized weights , and document locations u m . 

Coordinate update of q(Cn) The variational distribution on the topic index for word X^ 
is multinomial with parameter (p. For k = 1, . . . , T topics 



exp{E Q [ln % (X( m ))] +E Q [ln4 m) ]} . (22) 



Since = (jy^ when Xn = X^\ we only need to compute this update once for each 
unique word occurring in document m. 

Coordinate update of q(Z^) This variational gamma distribution has parameters ai m ^ 

and b^\ Let N m be the number of observations (e.g., words) in group m. After introducing an 
auxiliary parameter £ TO for each group- level distribution (discussed below), the updates are 

N m 



„( m ) _ i \" A m ) 

a k - PPk + 2^ <Pn,k ' 



n=l 



= exp{-fn m } + ^. (23) 

Si 



We again denote the top-level stick-breaking weights by pj~ = VkY[j=i(^ ~ K?)- The expecta- 
tions from this distribution that we use in subsequent updates are Eq^"^] = rajj, an d 

E Q [lnZ^]=^(4 m) )-ln4 m) . 

The auxiliary parameter allows us to approximate the term KQ[lnp(C n m ^ = fc|zj^)] appearing 
in the lower bound. To derive this, we use a first order Taylor expansion on the following 
intractable expectation, 



E 



Q 



k=l 



>_ ln?ra _£L.^pn-& 



The update for the auxiliary variable £ m is £ m = Ylk=l^Qi^k ]• ^ ee ^ ne a PP en dix for the 
complete derivation. 
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Coordinate update of q(u m ) We update the location of the mth document using gradient 
ascent, which takes the general form u' m = u m + pV Um C. We take several steps in updating this 
value within an iteration. For step s we update u m as 

T 

= (1 - p s )v$ +p s J2 (^Q[Z k ]e-^ - Pk ) 4- (24) 
k=i 

We let the step size p be a function of step number s, and (for example) set it to p s = ^(3 + s)" 1 
for s = 1, . . . , 20. We use 1/T to give a per-topic average, which helps to stabilize the magnitude 
of the gradient by removing its dependence on truncation level T, while (3 + s) _1 shrinks the 
step size. For each iteration, we reset s = 1. 

Corpus-level parameters 

After optimizing the variational parameters for each document, we turn to the corpus-level 
parameters. In the coordinate ascent algorithm, we update each corpus-level parameter once 
before returning to the document-level parameters. 

Coordinate update of q(r) k ) The variational distribution for the topic parameters is Dirichlet 
with parameter vector j k . For each of d = 1, . . . , D vocabulary words 

M N m 

7M = 70 + ££^(^ m)=d )' (25) 

m=l n=l 

where 70 is the parameter for the base distribution rj k ~ Dirichlet (70). Statistics needed for this 
term can be updated in unison with updates to q(C n m ^) for faster inference. 

Coordinate update of q(V k ) For k = 1, . . . , T — 1, the q distribution for each V k is a delta 
function, 5^. The truncation of the top-level DP results in Vr ■= 1. We use steepest ascent to 

jointly optimize V\, . . . , Vr-i- The gradient of each element is 



dc(-) _ «-J L + / 3 



(Wln4 m) ] - F k u m ) - M^Pk) 



dV k 1 - V k 

We observed similar performance using Newton's method in our experiments. 



v k ^ k i-v k 



(26) 



Coordinate update of q(£ k ) We update the location of the kth topic by gradient ascent, 
which has the general form £' k = l k + pV^C. We use the same updating approach as discussed 
for u m . For step s within a given iteration, the update is 



4 S+1) 



M 

(1 - p a /c)i k + PsYl ( E Q [Zk]e-^ S) - PPk) u m . (27) 



m=l 



As with u m , we let the step size p be a function of step number s, and set it to p s = jj(3 + s) 1 . 

Coordinate updates of g(a) and q((3) We place a Gamma(ri,T2) prior on a and model the 
posterior with a delta function. The update for this parameter is 



K + n-2 

T2-Ek=iHi-v k ) 



« = ^k-i\ 7. 7T, ( 2 8) 
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In our empirical study we set r\ = 1 and t% = 10 . 

We also place a Gamma(«i, K2) prior on the second-level concentration parameter (3 and 
optimize using gradient ascent. The first derivative is 

We set ki = 1 and k 2 = 10~ 3 . 



4-2. Stochastic variational inference 



The algorithm of Section 4.1 can be called a toc/i algorithm because it updates all document- 
level parameters in one "batch" before updating the global parameters. A potential drawback of 
this batch inference approach for DILN (as well as potential Monte Carlo sampling algorithms) 
is that the per- iteration running time increases with an increasing number of groups. For many 
modeling applications, the algorithm may be impractical for large-scale problems. 

One solution to the large-scale data problem is to sub-sample a manageable number of groups 
from the larger collection, and assume that this provides a good statistical representation of 
the entire data set. Indeed, this is the hope with batch inference, which views the data set as a 
representative sample from the larger, unseen population. However, in this scenario information 
contained in the available data set may be lost. Stochastic variational inference methods (Hoff- 
man, Blei and Bach, 2010; Sato, 2001; Wang, Paisley and Blei, 2011) aim for the best of both 
worlds, allowing one to fit global parameters for massive collections of data in less time than it 
takes to solve problems of moderate size in the batch setting. 

The idea behind stochastic variational inference is to perform stochastic optimization of the 
variational objective function in Equation (21). In topic modeling, we can construe this objective 
function as a sum over per-document terms and then obtain noisy estimates of the gradients by 
evaluating them on sets of documents sampled from the full corpus. By following these noisy 
estimates of the gradient with a decreasing step size, we are guaranteed convergence to a local 
optimum of the variational objective function (Hoffman, Blei and Bach, 2010; Robbins and 
Monro, 1951; Sato, 2001). 

Algorithmically, this gives an advantage over the optimization algorithm of Section 4.1 for 
large-scale machine learning. The bottleneck of that algorithm is the variational "E step," where 
the document-level variational parameters are optimized for all documents using the current 
settings of the corpus-level variational parameters (i.e., the topics and their locations, and a, 
(3). This computation may be wasteful, especially in the first several iterations, where the initial 
topics likely do not represent the corpus well. In contrast, the structure of a stochastic variational 
inference algorithm is to repeatedly subsample documents, analyze them, and then use them to 
update the corpus-level variational parameters. When the data set is massive, these corpus-level 
parameters can converge before seeing any document a second time. 

In more detail, let X be a very large collection of M documents. We separate the hid- 
den variables into those for the top-level 0' = {t]i : t, Vi : t—i,£i:T, a, P} and the document- 
level e m = {c[ jv m ,"Um, ZiJr} for m = 1, . . . , M. These variables have variational parameters 

VP' = {7i:T,i:D> h-.T, V\ : T-\, &, [3} and Vl/ TO = <4™ , b^) T , u m } for their respective Q distri- 

butions. Because of the independence assumption between documents, the variational objective 
decomposes into a sum over documents, 



M M 

£(x, *) = E Q [in P (x m , e m , e')] + Yl HQ(®m)} + e[Q(e')]. 

?Tl=l 771=1 



(30) 
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Algorithm 2 Stochastic variational Bayes for DILN 

Stochastically optimize the variational lower bound C 

Primary goal: Optimize corpus-wide variational parameters 

Secondary goal: Optimize document-specific parameters 9 m for m — 1, . . . , M 

1: while has not converged do 

2: Select random subset Bt C {1, ... , M} 

3: for m £ B t do 

4: Optimize * m (Equations 22-24) 

5: end for 

6: Set gradient step size pt = (£ + t) _fI , k € (|, 1] 

7: Update using gradient of constructed from documents m £ B t (Equations 26, 27, 29, 34, 36-38) 
8: end while 

9: Optimize ty m for m = 1, . . . , M using optimized \£' 



As we discussed, in batch inference we optimize variational distributions on 0i, . . . , Qm before 
updating those on ©'. Now, consider an alternate objective function at iteration t of inference, 

C {t \x mt ^ mt ,y') = ME Q [lnp(X mt ,e wt |0O] +MM[Q(& mt )] + E Q [lnp(e')] +M[Q(Q% (31) 

where mt is selected uniformly at random from {1, . . . , M}. An approach to optimize this ob- 
jective function would be to first optimize the variational parameters of Q(Q mt ), followed by 
a single gradient step for those of Q(Q'). In determining the relationship between Equation 
(31) and Equation (30), note that under the uniform distribution p(rrit) on which document is 
selected, 

E p{mt) [£« (X mt , * mt ,*')] = AX, *) . (32) 

We are thus stochastically optimizing C. In practice, one document is not enough to ensure fast 
convergence of Q(Q'). Rather, we select a subset B t C {1, . . . , M} at iteration t and optimize 

£M(X Bt ,* Bt ,*') = ^-^EQflnKX^e.ieOl + ^r^H^e,)] 

1 *' ieB t 1 *' ieB t 

+ E Q [lnp(6 / )] + M[Q(e% (33) 

over the variational parameters of Q(©£ t )- We again follow this with a step for the variational 
parameters of Q(O'), but this time using the information from documents indexed by Bf. That 
is, for some corpus-level parameter t/j £ the update of ip at iteration t + 1 given ip at iteration 
t is 

^(t+i) = ^(t) + ptAipVi)C ^ (X Bt , *b, , V), (34) 
where A^, is a positive definite preconditioning matrix and pt > is a step size satisfying 

oo oo 

= oo, Pt < oo- ( 35 ) 

t=l t=l 

In our experiments, we select the form pt = (C + t)~ K with k £ (0.5, 1] and £ > 0. 

In some cases, the preconditioner A^, can be set to give simple and clear updates. For example, 
in the case of topic modeling, Hoffman, Blei and Bach (2010) show how the inverse Fisher 
information leads to very intuitive updates (see the next section). This is a special case of the 
theory outlined in Sato (2001) that arises in conjugate exponential family models. However, the 
Fisher information is not required for stochastic variational inference; we can precondition with 
the inverse negative Hessian or decide not to precondition. 



Paisley, Wang and Blei/The Discrete Infinite Logistic Normal Distribution 



16 



4-2.1. The stochastic variational inference algorithm for DILN 

The stochastic algorithm selects a subset of documents at step t, coded by a set of index values 
Bt, and optimizes the document-level parameters for these documents while holding all corpus- 
level parameters fixed. These parameters are the word indicators C^ m \ the unnormalized topic 

weights and the document locations Uk- (See Section 4.1 for discussion on inference for 

these variables.) Given the values of the document-level variational parameters for documents 
indexed by Bt, we now describe the corpus- level updates in the stochastic inference algorithm. 
Algorithm 2 summarizes this general inference structure. 

Stochastic update of q(rj k ) This update follows from Hoffman, Blei and Bach (2010) and 
Wang, Paisley and Blei (2011). We set A 7fc to be the inverse Fisher information of q{n k ), 



A 



d 2 In q(r] k ) 



-i 



With this quantity, we take the product A 7fc V 7fc £^(X£ t , ^>B t ^ This leads to give the follow- 
ing update for each jk,d, 



(t+i) 



(1 



pthkld + ft 



7o + 



M 



E 



n,k 



d) 



(36) 



In this case, premultiplying the gradient by the inverse Fisher information cancels the Fisher 
information in the gradient and thus removes the cross-dependencies between the components of 
7fc. We use preconditioning to simplify the computation, rather than to speed up optimization. 
See Hoffman, Blei and Bach (2010), Wang, Paisley and Blei (2011) and Sato (2001) for details. 

Stochastic update of q(V k ) and q(£k) The stochastic updates of the delta q distributions do 
not use the Fisher information. Rather, we update the vectors V = [Vx, . ■ ■ , Vt-\] t and l k for 
k = 1, . . . , T by taking steps in their Newton directions using the data in batch Bt to determine 
this direction. The gradients V£ for these parameters are given in the batch algorithm and their 
form is unchanged here. The key difference is that the gradient of these parameters at step t is 
only calculated over documents with index values in Bt. We use the inverse negative Hessian as 
a preconditioning matrix for £ k and (Vx, . . . , Vr—l)- For l^, the preconditioning matrix is 



M 



m=l 



)u r , 



(37) 



For (Vi, . . . , Vt-x) the values of (A^ 1 )^ and (A^ 1 
(with the second derivatives written for r < k) 



>kr 



are found from the second derivatives 



d 2 C{-) 
9V k 2 

dv k dv r 



a — 1 

(i-v k y 



+ >k) 1 !': ( y, , " J 



Vk \ Vk 



Pk 



Pk 



I - V r ) \ Vk 



j>k 1 



j>k 

P.I 



Vk 



Vk 



+ 



(38) 
(39) 



P 



E 



E Q [lnZ 



Hi 

k \ 



Pk 



V k (l - V r ] 



j>k \ x 



P;i 



(1-V k )(l-V r ) 
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Online update of q{a) and q(/3) The stochastic updates for j3 move in the direction of 
steepest ascent, calculated using the documents in the batch. Since this is a one-dimensional 
parameter, we optimize a batch-specific value for this parameter at step t, fit, and set fit+i = 
(1 — pt)$t + ptPt- The update for a does not consider document-level parameters, and so this 
value follows the update given in Equation (28). 

4-3. A new variational inference algorithm for the HDP 

The variational inference algorithm above relates closely to one that can be derived for the HDP 
using the normalized gamma process representation of Section 2.2. The difference lies in the 
update for the topic weight q(Z^ n ) in Equation (23). In both algorithms, the update for its 

variational parameter contains the prior from the top-level DP, and the expected number 
of words in document m drawn from topic k. The variational parameter &( m ) distinguishes DILN 
from the HDP. 

We can obtain a variational inference algorithm for the HDP by setting the first term in 
the update for equal to one. In contrast, the first term for DILN is exp{— $£u m }, which 
is the Gaussian process that generates the covariance between component probability weights. 
Including or excluding this term switches between variational inference for DILN and variational 
inference for the HDP. See the appendix for a fuller derivation. 

4.4. MCMC inference 

Markov chain Monte Carlo (MCMC, Robert and Casella, 2004) sampling is a more common 
strategy for approximate posterior inference in Bayesian nonparametric models, and for the hi- 
erarchical Dirichlet process in particular. In MCMC methods, samples are drawn from a carefully 
designed Markov chain, whose stationary distribution is the target posterior of the model pa- 
rameters. MCMC is convenient for the many Bayesian nonparametric models that are amenable 
to Gibbs sampling, where the Markov chain iteratively samples from the conditional distribution 
of each latent variable given all of the other latent variables and the observations. 

However, Gibbs sampling is not an option for DILN because the Gaussian process component 
does not have a closed-form full conditional distribution. One possible sampling algorithm for 
DILN inference would use Metropolis-Hastings (Hastings, 1970), where samples are drawn from 
a proposal distribution and then accepted or rejected. Designing a good proposal distribution 
is the main problem in designing Metropolis-Hastings algorithms, and in DILN this problem is 
more difficult than usual because the hidden variables are highly correlated. 

Recently, slice sampling has been applied to sampling of infinite mixture models by turning 
the problem into a finite sampling problem (Griffin and Walker, 2010; Kalli, Griffin and Walker, 
2011). These methods apply when the mixture weights are either from a simple stick-breaking 
prior or a normalized random measures that can be simulated from a Poisson process. Neither 
of these settings applies to DILN because the second-level DP is a product of a DP and an expo- 
nentiated GP. Furthermore, it is not clear how to extend slice sampling methods to hierarchical 
models like the HDP or DILN. 

Variational methods mitigate all these issues by using optimization to approximate the pos- 
terior. Our algorithm sacrifices the theoretical (and eventual) convergence to the full posterior 
in favor of a simpler distribution that is fit to minimize its KL-divergence to the posterior. 
Though we must address issues of local minima in the objective, we do not need to develop 
complicated proposal distributions or solve the difficult problem of assessing convergence of a 
high-dimensional Markov chain to its stationary distribution. 3 Furthermore, variational infer- 



3 Note our evaluation method of Section 5 does not use the divergence of the variational approximation and 



Paisley, Wang and Blei/The Discrete Infinite Logistic Normal Distribution 18 

Table 1 

Data sets. Five training/testing sets were constructed by selecting the number of documents shown for each 

corpus from larger data sets. 



Corpus 


# training 


# testing 


vocabulary size 


# total words 


Huffington Post 


3,000 


1,000 


6,313 


660,000 


New York Times 


5,000 


2,000 


3,012 


720,000 


Science 


5,000 


2,000 


4,403 


1,380,000 


Wikipedia 


5,000 


2,000 


6,131 


1,770,000 



ence is ideally suited to the stochastic optimization setting, allowing for approximate inference 
with very large data sets. 



5. Empirical study 

We evaluate the DILN topic model with both batch and stochastic inference. For batch infer- 
ence, we compare with the HDP and correlated topic model (CTM) on four text corpora: The 
Huffington Post, The New York Times, Science and Wikipedia. We divide each corpus into five 
training and testing groups selected from a larger set of documents (see Table 1). 

For stochastic inference, we use the Nature corpus to assess performance. This corpus contains 
352,549 documents spanning 1869-2003; we used a vocabulary of 4,253 words. We compare 
stochastic DILN with a stochastic HDP algorithm and with online LDA (Hoffman, Blei and 
Bach, 2010). 



5.1. Evaluation metric 



Before discussing the experimental setup and results, we discuss our method for evaluating 
performance. We evaluate the approximate posterior of all models by measuring its predictive 
ability on held-out documents. Following Asuncion et al. (2009), we randomly partition each 
test document into two halves and evaluate the conditional distribution of the second half given 
the first half and the training data. Operationally, we use the first half of each document to find 
estimates of document-specific topic proportions and then evaluate how well these combine with 
the fitted topics to predict the second half of the document. 

More formally, denote the training data by T>, a test document as X, which is divided into 
halves X' and X". We want to calculate the conditional marginal probability, 

p(X"|X',2>) = I H (j>M»fcMq£ = k\Z l:T )\ dQ(Z)dQ(rj) (40) 

where ./V is the number of observations constituting X", is the latent indicator associated 
with the nth word in X", and n := T]\ : t and Z := Z\-t- 

Since the integral in Equation (40) is intractable, we sample i.i.d. values from the factorized 
distributions Q(Z\.jf) and Q(?7i : r) f° r approximation. We note that the information regarding 
the document's correlation structure can be found in Q(Z\ : t). 

We then use this approximation of the marginal likelihood to compute the average per-word 
perplexity for the second half of the test document, 

, . f-liip(X"|X'n 
perplexity = exp <^ — \ , (41) 



the true posterior. Rather, we measure the corresponding approximation to the predictive distribution. On a pilot 
study of batch inference, we found that MCMC inference (with its approximate predictive distribution) did not 
produce distinguishable results from variational inference. 
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Fig 3. Perplexity results for four text corpora and averaged over five training/testing sets. For a fixed Dirichlet 
hyperparameter, the DILN topic model typically achieves better perplexity than both the HDP and CTM models. 
In all corpora, DILN achieves the best perplexity overall. 



with lower perplexity indicating better performance. Note that the term mp(X"|X') involves a 
sum over the N words in X". Also note that this is an objective measure of the predictive perfor- 
mance of the predictive probability distribution computed from the variational approximation. 
It is a good measure of performance (of the model and the variational inference algorithm) be- 
cause it does not rely on the closeness of the variational distribution to the true posterior, as 
measured by the variational lower bound. That closeness, much like whether a Markov chain 
has converged to its stationary distribution, is difficult to assess. 

5.2. Experimental setup and results 

Batch variational inference experiments We trained all models using variational infer- 
ence; for the CTM, this is the algorithm given in Blei and Lafferty (2007); for the HDP, we use 
the inference method from Section 4. For DILN, we use a latent space with d = 20 and set the 
location variance parameter c = 1/20. For DILN and the HDP, we truncate the top-level stick- 
breaking construction at T = 200 components. For the CTM, we consider K G {20, 50, 150} 
topics. In our experiments, both DILN and HDP used significantly fewer topics than the trun- 
cation level, indicating that the truncation level was set high enough. The CTM is not sparse in 
this sense. 

We initialize all models in the same way; to initialize the variational parameters of the topic 
Dirichlet, we first cluster the empirical word distributions of each document with three iterations 
of k-means using the L\ distance measure. We then reorder these topics by their usage according 
to the indicators produced by k-means. We scale these k-means centroids and add a small 
constant plus noise to smooth the initialization. The other parameters are initialized to values 
that favor a uniform distribution on these topics. Variational inference is terminated when the 
fractional change in the lower bound of Equation (21) falls below 10 -3 . We run each algorithm 
using five different topic Dirichlet hyperparameter settings: 70 G {0.1,0.25,0.5,0.75, 1.0}. 
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Figure 3 contains testing results for the four corpora. In general, DILN outperforms both the 
HDP and CTM. Given that the inference algorithms for DILN and the HDP are only different in 
the one term discussed in Section 4.3, this demonstrates that the latent location space models a 
correlation structure that helps in predicting words. Computation time for DILN and the HDP 
was comparable, both requiring on the order of one minute per iteration. Depending on the 
truncation level, the CTM was slightly to significantly faster than both DILN and the HDP. 

We display the learned correlation structure for the four corpora in Figures 4-6. (see Figure 
1 for results on a slightly larger Wikipedia corpus.) In these figures, we represent the 30 most 
probable topics by their ten most probable words. Above these lists, we show the positive and 
negative correlations learned using the latent locations if.. For two topics i and j this value is 
CVINbll^lk- From these fi gures, we see that DILN learns meaningful underlying correlations 
in topic expression within a document. 

As we discussed in Section 3.4, the underlying vectors u m E R rf associated with each document 
can be used for retrieval applications. In Figure 7, we show recommendation lists for a 16,000 
document corpus of the journal Science obtained using these underlying document locations. 
We use the cosine similarity between two documents for ranking, which for documents i and j 
is equal to u?\tj/||uj||2||'!%'||2- We show several lists of recommended articles based on randomly 
selected query articles. These lists show that, as with the underlying correlations learned between 
the topics, DILN learns a meaningful relationship between the documents as well, which is useful 
for navigating text corpora. 

Stochastic variational inference We compare stochastic DILN with stochastic HDP and 
online LDA using 352,549 documents from Nature. As for batch inference, we can obtain a 
stochastic inference algorithm for the HDP as a special case of stochastic DILN. In DILN, 
we again use a latent space of d = 20 dimensions for the component locations and set the 
location variance parameter to c = 1/20. We truncate the models at 200 topics, and we evaluate 
performance for K E {25,75, 125} topics with stochastic inference for LDA (Hoffman, Blei and 
Bach, 2010). As we discussed in Section 4.2, we use a step sequence of pt = (C + t)~ K . We set 
C = 25, and run the algorithm for k E {0.6, 0.75, 0.9}. We explored various batch sizes, running 
the algorithm for \B t \ E {250,750, 1250}. Following Hoffman, Blei and Bach (2010), we set the 
topic Dirichlet hyperparameters to 70 = 0.01. 

For testing, we held out 10, 000 randomly selected documents from the corpus. We measure the 
performance of the stochastic models after every 10th batch. Within each batch, we run several 
iterations of local variational inference to find document-specific parameters. We update corpus- 
level parameters when the change in the average per-document topic distributions falls below 
a threshold. On average, roughly ten document-level iterations were run for each corpus-level 
update. 

Figure 8 illustrates the results. In this figure, we show the per- word held-out perplexity as 
a function of the number of documents seen by the algorithm. From these plots we see that a 
slower decay in the step size improves performance. Especially for DILN, we see that performance 
improves significantly as the decay k decreases, since more information is being used from later 
documents in finding a maximum of the variational objective function. Slower decays are helpful 
because more parameters are being fitted by DILN than by the HDP and LDA. We observed 
that as k increases a less detailed correlation structure was found; this accounts for the decrease 
in performance. 

In Figure 11 we show the model after one pass through the Nature corpus. The upper left 
figure shows the locations of the top 50 topics projected from M 20 . These locations are rough 
approximations since the singular values were large for higher dimensions. The upper right figure 
shows the correlations between the topics. Below these two plots, we show the ten most probable 
words from the 50 most probable topics. In Figure 9 we show a and $ as a function of the number 
of documents seen by the model. In Figure 10 we show the correlations between 100 pairs of 
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Topic 1: campaign, democratic, candidate, republican, election, voter, political, presidential, vote, party 
Topic 2: game, victory, second, score, third, win, team, play, season, lose 

Topic 3: president, executive, chief, vice, name, director, advertising, chairman, senior, company 

Topic 4: team, player, season, coach, game, play, football, league, contract, sign 

Topic 5: add, heat, pound, cup, oil, minute, water, large, dry, serve 

Topic 6: building, build, house, space, site, project, construction, area, foot, plan 

Topic 7: drug, patient, treatment, study, disease, risk, health, treat, cancer, cause 

Topic 8: economy, economic, percent, growth, increase, government, states, economist, price, rate 

Topic 9: police, officer, arrest, man, charge, yesterday, official, crime, drug, release 

Topic 10: share, company, stock, buy, percent, investment, acquire, sell, investor, firm 

Topic 11: budget, tax, cut, increase, taxis, state, plan, propose, reduce, pay 

Topic 12: shot, point, play, game, hit, ball, night, shoot, player, put 

Topic 13: computer, internet, information, site, technology, system, software, online, user, program 
Topic 14: art, artist, museum, exhibition, painting, collection, gallery, design, display, sculpture 
Topic 15: government, political, country, international, leader, soviet, minister, states, foreign, state 
Topic 16: book, story, write, novel, author, life, woman, writer, storey, character 
Topic 17: attack, kill, soldier, bomb, bombing, area, official, report, group, southern 
Topic 18: song, sing, band, pop, rock, audience, singer, voice, record, album 
Topic 19: market, stock, price, fall, trading, dollar, investor, trade, rise, index 
Topic 20: trial, lawyer, charge, prosecutor, case, jury, guilty, prison, sentence, judge 
Topic 21: play, movie, film, star, actor, character, theater, role, cast, production 

Topic 22: dance, stage, perform, dancer, company, production, present, costume, theater, performance 
Topic 23: peace, israeli, Palestinian, talk, Palestinians, territory, arab, leader, visit, settlement 
Topic 24: guy, thing, lot, play, feel, kind, game, really, little, catch 

Topic 25: science, theory, scientific, research, human, suggest, evidence, fact, point, question 
Topic 26: court, law, state, legal, judge, rule, case, decision, appeal, lawyer 

Topic 27: image, photograph, picture, view, photographer, subject, figure, paint, portrait, scene 
Topic 28: report, official, member, commission, committee, staff, agency, panel, investigate, release 
Topic 29: wine, restaurant, food, menu, price, dish, serve, meal, chicken, dining 
Topic 30: graduate, marry, father, degree, receive, ceremony, wedding, daughter, son, president 



Fig 4. New York Times: The ten most probable words from the 30 most popular topics. At top are the positive and 
negative correlation coefficients for these topics calculated by taking the dot product of the topic locations, t^ly 
(separated for clarity). 



topics chosen at random; these are also shown as a function of the number of documents seen. 
In general, these plots indicate that the parameters are far along in the process of converging to 
a local optimum after just one pass through the entire corpus. Also shown in Figure 10 is the 
empirical word count per topic (that is, the values ^ mn I(Ci™' = k) as a function of k) after 
the final iteration of the first pass through the data. We see that the model learns approximately 
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Topic 1: get, really, like, just, know, hes, think, dont, thing, say 

Topic 2: percent, year, said, last, prices, economy, quarter, home, economic, housing 
Topic 3: day, mother, life, family, father, mothers, love, time, home, fathers 
Topic 4: make, like, dont, youre, people, time, get, see, love, just 

Topic 5: delegates, obama, superdelegates, democratic, party, states, convention, primaries, michigan 
Topic 6: mccain, john, mccains, republican, campaign, bush, hes, just, senator, said 
Topic 7: show, song, said, music, night, first, david, like, simon, performance 

Topic 8: clinton, obama, clintons, hillary, nomination, democratic, barack, race, obamas, supporters 

Topic 9: hillary, obama, president, candidate, shes, win, time, democratic, hillarys, running 

Topic 10: iran, nuclear, weapons, states, said, united, attack, bush, president, iranian 

Topic 11: democrats, republican, republicans, election, democratic, house, vote, states, gop, political 

Topic 12: words, word, people, power, like, language, point, written, person, powerful 

Topic 13: iraq, war, american, bush, afghanistan, years, petraeus, troops, new, mission 

Topic 14: voters, obama, indiana, Carolina, north, clinton, polls, primary, democratic, Pennsylvania 

Topic 15: america, american, nation, country, americans, history, civil, years, king, national 

Topic 16: said, city, people, two, homes, area, water, river, state, officials 

Topic 17: media, news, story, coverage, television, new, public, journalism, broadcast, channel 

Topic 18: israel, peace, israeli, east, hamas, Palestinian, state, arab, middle, israels 

Topic 19: poll, chance, gallup, degrees, winning, results, tracking, general, election, august 

Topic 20: said, iraqi, government, forces, baghdad, city, shiite, security, sadr, minister 

Topic 21: senator, obama, obamas, people, clinton, Pennsylvania, comments, bitter, remarks, negative 

Topic 22: rights, law, court, justice, constitution, supreme, right, laws, courts, constitutional 

Topic 23: company, said, billion, yahoo, stock, share, inc, deal, microsoft, shares 

Topic 24: health, care, families, insurance, working, pay, help, americans, plan, people 

Topic 25: white, race, voters, obama, Virginia, west, percent, states, whites, win 

Topic 26: wright, obama, rev, jeremiah, pastor, obamas, reverend, political, said, black 

Topic 27: tax, government, economic, spending, taxes, cuts, economy, budget, federal, people 

Topic 28: study, cancer, found, drugs, age, risk, drug, heart, brain, medical 

Topic 29: people, man, black, america, didnt, god, hope, know, years, country 

Topic 30: global, climate, warming, change, energy, countries, new, carbon, environmental, emissions 



Fig 5. Huffington Post: The ten most probable words from the 30 most popular topics. At top are the positive and 
negative correlation coefficients for these topics calculated by taking the dot product of the topic locations, i^ty 
(separated for clarity). 



50 topics out of the 200 initially supplied. All results are shown for a batch size of 750. 

Stochastic DILN vs batch DILN We also compare stochastic and batch inference for 
DILN to show how stochastic inference can significantly speed up the inference process, while 
still giving results as good as batch inference. We again use the Nature corpus. For stochastic 
inference, we use a subset of size \B t \ = 1000 and a step of (1 + t)™ - 75 . For batch inference, 
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Topic 1: manager, science, fax, advertising, aaas, sales, recruitment, member, associate, Washington 

Topic 2: research, science, funding, scientists, university, universities, government, program, year 

Topic 3: fault, plate, earthquake, earthquakes, zone, crust, seismic, fig, crustal, large 

Topic 4: hiv, virus, infection, infected, viral, viruses, human, immunodeficiency, aids, disease 

Topic 5: species, forest, forests, conservation, ecosystems, fish, natural, land, tropical, ecological 

Topic 6: climate, changes, temperature, change, global, atmospheric, carbon, years, year, variability 

Topic 7: cells, immune, cell, antigen, response, responses, mice, lymphocytes, antibody, specific 

Topic 8: transcription, binding, dna, transcriptional, promoter, polymerase, factors, site, protein 

Topic 9: says, university, just, colleagues, team, like, researchers, meeting, new, end 

Topic 10: structure, residues, helix, binding, two, fig, helices, side, three, helical 

Topic 11: proteins, protein, membrane, ras, gtp, binding, bound, transport, guanosine, membranes 

Topic 12: pressure, temperature, high, phase, pressures, temperatures, experiments, gpa, melting 

Topic 13: rna, mrna, site, splicing, rnas, pre, intron, base, cleavage, nucleotides 

Topic 14: protein, cdna, fig, sequence, lane, purified, human, lanes, clone, gel 

Topic 15: kinase, protein, phosphorylation, kinases, activity, activated, signaling, camp, pathway 

Topic 16: university, students, says, faculty, graduate, women, science, professor, job, lab 

Topic 17: new, says, university, years, human, humans, ago, found, modern, first 

Topic 18: researchers, found, called, says, team, work, colleagues, new, university, protein 

Topic 19: isotopic, carbon, oxygen, isotope, water, values, ratios, organic, samples, composition 

Topic 20: disease, patients, diseases, gene, alzheimers, cause, mutations, syndrome, protein, genetic 

Topic 21: aids, vaccine, new, researchers, vaccines, trials, people, research, clinical, patients 

Topic 22: receptor, receptors, binding, ligand, transmembrane, surface, signal, hormone, extracellular 

Topic 23: cells, cell, bone, human, marrow, stem, types, line, lines, normal 

Topic 24: united, states, countries, international, world, development, japan, european, nations, europe 
Topic 25: proteins, protein, yeast, two, domain, sequence, conserved, function, amino, family 
Topic 26: letters, mail, web, end, new, org, usa, science, full, letter 

Topic 27: amino, acid, peptide, acids, peptides, residues, sequence, binding, sequences, residue 
Topic 28: species, evolution, evolutionary, phylogenetic, biology, organisms, history, different, evolved 
Topic 29: ocean, sea, pacific, water, atlantic, marine, deep, surface, north, waters 

Topic 30: gene, genes, development, genetic, mouse, function, expressed, expression, molecular, product 



Fig 6. Science: The ten most probable words from, the 30 most popular topics. At top are the positive and negative 
correlation coefficients for these topics calculated by taking the dot product of the topic locations, £^£k' (separated 
for clarity). 



we use a randomly selected subset of documents, performing experiments on corpus size M S 
{25000, 50000, 100000}. All algorithms used the same test set and testing procedure, as discussed 
in Section 5.1. All experiments were run on the same computer to allow for fair time comparisons. 

In Figure 12, we plot the held-out per-word log likelihood as a function of time. We measured 
performance every tenth iteration to construct each curve. The stochastic inference curve repre- 
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Czech Republic: Grad School Bridges Old Divisions 


i 


A I a r' o Ricl^c 


0.97 


: Central Europe: After Communism: Reinventing Higher Education 


0.97 


: Depicting Epidemiology 


0.97 


: A Scientific Community on the Edge 


0.94 


: EC Biotechnology Policy 


0.96 


: Poland: Teachers Struggle With Low Funds and Morale 


0.94 


: Global Warming 


0.96 


: Will Profits Override Political Protests 


0.94 


: Indirect Costs 


0.96 


: A Second Chance to Make a Difference in the Third World? 


0.94 


: Biology Textbooks 


1 


: Human Gene Therapy Protocols: RAC Review 


1 : 


A Cooler Way to Balance the Sea's Salt Budget 


0.89 


: Funding of NIH Grant Applications: Update 


0.94 : 


New Crater Age Undercuts Killer Comets 


0.89 


: Lyme Disease Research 


0.94 : 


A Piece of the Dinosaur Killer Found? 


0.89 


: AIDS Virus History 


0.94 : 


Reading History from a Single Grain of Rock 


0.88 


: Guidelines for Xenotransplantation 


0.92 : 


Ancient Rocks, Rhythms in Mud, a Tipsy Venus 


0.88 


: Communication Sciences: A Thriving Discipline 


0.91 : 


Deep-Sea Coral Records Quick Response to Climate 


1 


Is the Universe Fractal? 


1 


: Calculus Reform 


0.94 


: Extracting Primordial Density Fluctuations 


0.97 


: Characterizing Scientific Knowledge 


0.94 


: Ages of the Oldest Clusters and the Age of the Universe 


0.96 


: Doctoral Entitlement? 


0.93 


: The Age and Size of the Universe 


0.95 


: Peer-Review Study 


0.92 


: From Microwave Anisotropies to Cosmology 


0.94 


: Organoids and Genetic Drugs 


0.92 


: Multiscaling Properties of Large-Scale Structure in the Universe 


0.94 


: Corrections and Clarifications: Getting to the Front of the Bus 
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: Transmuting Light Into X-rays 


0.96 


: Tumor Cells Fight Back to Beat Immune System 


0.86 


: Atomic Mouse Probes the Lifetime of a Quantum Cat 


0.95 


: Taming Rogue Immune Reactions 


0.86 


: An Everyman's Free-Electron Laser? 


0.95 


: Cancer Vaccines Get a Shot in the Arm 


0.85 


: Knocking Genes In Instead of Out 


0.94 


: Thyroid Disease: A Case of Cell Suicide? 


0.85 


: Laser Pulses Make Fast Work of an Optical Switch 


0.94 


: Concerns Raised About Mouse Models for AIDS 


0.85 


: Putting the Infrared Heat on X-rays 


1 
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0.98 


: New Skeleton Gives Path From Trees to Ground an Odd Turn 


0.93 


: Key Protein Found for Brain's Dopamine-Producing Neurons 


0.97 


: New Hominid Crowds the Field 


0.91 


: Technical Advances Power Neuroscience 


0.97 


: Amazonian Diversity: A River Doesn't Run Through It 


0.91 


: Researchers Find Signals That Guide Young Brain Neurons 


0.97 


: A New Face for Human Ancestors 


0.91 


: Knockouts Shed Light on Learning 


0.96 


: From Embryos and Fossils, New Clues to Vertebrate Evolution 


0.91 


: Synapse-Making Molecules Revealed 


1 : 


Lighting a Route to the New Physics-With Photons 


1 


Emergent Properties of Networks of Biological Signaling Pathways 


0.97 : 


Conjuring Matter From Light 


0.90 


: Complexity in Biological Signaling Systems 


0.96 : 


The Subtle Flirtation of Ultracold Atoms 


0.88 


: What Maintains Memories? 


0.96 : 


Making Waves With Interfering Atoms 


0.87 


: Molecular Code for Cooperativity in Hemoglobin 


0.96 : 


First Atom Laser Shoots Pulses of Coherent Matter 


0.87 


: Biological Information Processing: Bits of Progress 


0.95 : 


Interfering with Atoms to Clear a Path for Lasers 


0.87 


: The Path to Specificity 


1 : 
0.95 : 
0.91 : 
0.90 : 
0.90 : 
0.90 : 


Small NASA Missions 
Analogies with Meaning 
NASA Funding for Earth Science 
Asking for the Moon 
Delaney Reform 
New Observations 


1 

0.97 
0.97 
0.97 
0.96 
0.96 


: The Superswell and Mantle Dynamics Beneath the South Pacific 

: Phase Boundaries and Mantle Convection 

: Not So Hot Hot Spots in the Oceanic Mantle 

: Seismic Attenuation Structure of Fast-Spreading Mid-Ocean Ridge 

: Compositional Stratification in the Deep Mantle 

: Mantle Plumes and Continental Tectonics 



Fig 7. Several example document searches for Science. The first document is the query document, followed by the 
most similar documents according to the cosine similarity measure on their locations (given at left). 

sents roughly six passes through the entire corpus. For batch inference, we see that performance 
improves significantly as the sub-sampled batch size increases. However, this improvement is 
paid for with an increasing runtime. Stochastic inference is much faster, but still performs as 
well as batch in predicting test documents. 

6. Discussion 

We have presented the discrete infinite logistic normal distribution, a Bayesian nonparamet- 
ric prior for mixed-membership models. DILN overcomes the hidden assumptions of the HDP 
and explicitly models correlation structure between the mixing weights at the group level. We 
showed how using the second parameter of the gamma process representation of the hierar- 
chical Dirichlet process achieves this by varying per-component according to an exponentiated 
Gaussian process. This Gaussian process is defined on latent component locations added to the 
hierarchical structure of the HDP. 
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Fig 8. Stochastic variational inference results on Nature. The number of documents processed is shown in log 
scale. We observe improved performance for all algorithms as k decreases, and note that DILN is able to obtain 
a level of performance not reached by HDP and LDA as a function of parameter settings. 



Using batch variational Bayesian inference, we showed an improvement in predictive ability 
over the HDP and the CTM in a topic modeling application. Furthermore, we showed how this 
algorithm can be modified to obtain a new variational inference algorithm for HDPs based on the 
gamma process. We then extended the model to the stochastic inference setting, which allows 
for fast analysis of much larger corpora. 

DILN can be useful in other modeling frameworks. For example, hidden Markov models can 
be viewed as a collection of mixture models that are defined over a shared set of parameters, 
where state transitions follow a Markov transition rule. Teh et al. (2006) showed how the HDP 
can be applied to the HMM to allow for infinite state support, thus creating a nonparametric 
hidden Markov model, where the number of underlying states is inferred. DILN can be adapted 
to this problem as well, in this case modeling correlations between state transition probabilities. 

7. Appendix 

7.1. Proof of almost sure finiteness of Yli^Li Zi& Wi 

We drop the group index m and define Wi := W(£i). The normalizing constant for DILN, prior 
to absorbing the scaling factor within the gamma distribution, is S := YH^i^i^- We first 
show that this value is finite almost surely when the Gaussian process has bounded mean and 
covariance functions. This case would apply for example when using a Gaussian kernel. We then 
give a proof for the kernel in Section 3.4 when the value of c < 1. 

Let St '■= YlJ=i Zie Wi . It follows that Si < • • • < St < • • • < S and S = Huit^oo St- To 
prove that S is finite almost surely, we only need to prove that E[5] is finite. From the monotone 
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number of documents seen xio 5 



Fig 9. Stochastic learning of Nature. The values of a and f3 as a function of number of documents seen for batch 
size equal to 750 and learning rate k = 0.6. 




number of documents seen * 1 ° 5 topic number (sorted) 



Fig 10. Stochastic learning of Nature, (left) Correlations between 100 randomly selected pairs of topics as a 
function of documents seen, (right) The empirical word count from the posteriors of the top 50 topics after the 
final iteration. Approximately 50 of the 200 topics are used. 

convergence theorem, we have that K[S] = limT_ ) . 00 E[5y]. Furthermore, E[Sr] can be upper 
bounded as follows, 

E[S T ] = £f=i nZi}E[e w >] < e max ^+K) £f =1 E[Z<]. (42) 

IE [S] is therefore upper bounded by j Qe maXi ^ + 2°f) and S is finite almost surely. 

For the kernel in Section 3.4, we prove that E[5] < oo when c < 1. We only focus on this case 
since values of c > 1 are larger than we are interested in for our application. For example, given 
that £ £ M. d and I ~ Normal(0, dd), it follows that E[^ T £] = dc, which is the expected variance 
of the Gaussian process at this location. In our applications, we set c = 1 /d, which is less than 
one when d > 1. As above, we have 

nS T ] = £L E[Z,]E[e^«] = Yl=i /3ftE[ei (43) 

Since u ~ Normal(0, Id), this last expectation is finite when c < 1, and therefore the limit 
lim^oo E[Sr] is also finite. 
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5 10 15 30 £5 30 35 40 45 50 



Topic 1: author, facts, original, written, hand, text, think, himself, pages, mind 

Topic 2: war, england, carried, death, french, german, issued, great-britain, sent, works 

Topic 3: equation, flow, sample, average, mantle, rates, distribution, zone, ratios, calculated 

Topic 4: million, scientists, policy, britain, social, economic, technology, political, project, organization 

Topic 5: gene, genes, expression, mutant, wild-type, sequence, supplementary, embryos, mutants, clones 

Topic 6: glass, tube, colour, due, substance, rays, apparatus, substances, action-of, series 

Topic 7: serum, labelled, fraction, anti, purified, buffer, fractions, rabbit, extract, extracts 

Topic 8: feet, rocks, island, specimens, sea, coast, islands, river, land, geological 

Topic 9: membrane, enzyme, concentration, glucose, inhibition, calcium, release, phosphate 

Topic 10: population, evolution, selection, genetic, environment, evolutionary, food, birds, breeding 

Topic 11: college, secretary, council, Cambridge, department, engineering, assistant, mathematics 

Topic 12: frequency, wave, spectrum, electron, absorption, band, electrons, optical, signal, peak 

Topic 13: binding, proteins, residues, peptide, chain, amino-acid, domain, terminal, sequence 

Topic 14: dna, rna, sequence, sequences, mrna, poly, fragments, synthesis, fragment, phage 

Topic 15: molecules, compounds, oxygen, molecule, reactions, formation, ion, ions, oxidation, compound 

Topic 16: the-sun, solar, the-earth, motion, observatory, stars, comet, star, night, planet 

Topic 17: techniques, materials, applications, reader, design, basic, service, computer, fundamental 

Topic 18: crystal, structures, unit, orientation, ray, diffraction, patterns, lattice, layer, symmetry 

Topic 19: vol, museum, plates, india, journal, ltd, net, indian, series, Washington 

Topic 20: sea, ice, ocean, depth, deep, the-earth, climate, sediments, earth, global 

Topic 21: you, says, her, she, researchers, your, scientists, colleagues, get, biology 

Topic 22: mice, anti, mouse, tumour, antigen, antibody, cancer, tumours, antibodies, antigens 

Topic 23: disease, blood, bacteria, patients, drug, diseases, clinical, drugs, bacterial, host 

Topic 24: radio, ray, emission, flux, stars, disk, sources, star, galaxies, galaxy 

Topic 25: brain, receptor, receptors, responses, stimulation, response, stimulus, cortex, synaptic, stimuli 
Topic 26: rats, liver, tissue, blood, dose, injection, rat, plasma, injected, hormone 

Topic 27: royal, lecture, lectures, engineers, royal-society, hall, institution-of, society-at, annual, january 

Topic 28: virus, cultures, culture, medium, infected, infection, viral, viruses, agar, colonics 

Topic 29: heat, oil, coal, electric, electricity, electrical, lead, supply, steam, tons 

Topic 30: particles, particle, electron, proton, neutron, protons, mev, force, scattering, nuclei 

Topic 31: education, universities, training, schools, teaching, teachers, courses, colleges, grants, student 

Topic 32: nuclear, radiation, irradiation, radioactive, uranium, fusion, reactor, storage, damage 

Topic 33: iron, copper, steel, metals, milk, aluminium, alloys, silicon, ore, haem 

Topic 34: soil, nitrogen, leaves, land, agricultural, agriculture, nutrient, yield, growing, content 

Topic 35: chromosome, nuclei, hybrid, chromatin, mitotic, division, mitosis, chromosomal, somatic 

Topic 36: pulse, spin, magnetic-field, pulses, polarization, orbital, decay, dipole, pulsar, polarized 

Topic 37: atoms, quantum, atom, einstein, classical, photon, relativity, bohr, quantum-mechanics 

Topic 38: strain, stress, strains, deformation, shear, stresses, failure, viscosity, mechanical, stressed 

Topic 39: medical, health, medicine, tuberculosis, schools, education, teaching, infection, bacilli, based 

Topic 40: adult, females, males, mating, mature, progeny, adults, maturation, aggressive, matings 



Fig 1 1 . Stochastic DILN after one pass through the Nature corpus. The upper left figure shows the projected topic 
locations with + marking the origin. The upper right figure shows topic correlations. We list the ten most probable 
for the first 40 topics. 
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Fig 12. A comparison of stochastic and batch inference for DILN using the Nature corpus. Results are shown 
as a function of time (log scale). Stochastic inference achieves a good posterior approximation significantly faster 
than batch inference, which pays for improved performance with an increasing runtime. 



7.2. Variational inference for normalized gamma measures 

In DILN, and normalized gamma models in general, the expectation of the log of the normalizing 
constant, Eq [In Z^\ , is intractable. We present a method for approximate variational Bayesian 
inference for these models. A Taylor expansion on this term about a particular point allows for 
tractable expecations, while still preserving the lower bound on the log-evidence of the model. 
Since the log function is concave, the negative of this function can be lower bounded by a 
first-order Taylor expansion, 



— E, 



Q 



ln^Z fc 



k=l 



> 



E fc E Q [Z fc ]-g 



(44) 



We have dropped the group index m for clarity. A new term £ is introduced into the model as 
an auxiliary parameter. Changing this parameter changes the tightness of the lower bound, and 
in fact, it can be removed by permanently tightening it, 



£ = J> g [Z fc ]. 



(45) 



k=l 



In this case Eg [In ^ fc Z^\ is replaced with In Eg[Zfc] in the variational objective function. We 
do not do this, however, since retaining £ in DILN allows for analytical parameter updates, while 
using Equation (45) requires gradient methods. These analytical updates result in an algorithm 
that is significantly faster. For example, inference for the corpora considered in this paper ran 
approximately five times faster. 

Because this property extends to variational inference for all mixture models using the nor- 
malized gamma construction, most notably the HDP, we derive these updates using a generic 
parameterization of the gamma distribution, Gamma(afc, The posterior of Z\-t in this model 
is proportional to 

" N T / \ l(C n =k)~ 

nn 



p(Zi :T \Ci:N,ai:T,h:T) OC 



Zk 



n=l k=l 



-l n -b k Z k 



(46) 
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k=l 



Under a factorized Q distribution, the variational lower bound at nodes Z%-t is 

N T 

E Q [lnp(Zi :T |-)] +H[Q] = ^2^ Q (C n = k)E Q [lnZ k ] - NE Q 

n=l k=l 

T T 

+ ^(E Q [a fc ] - l)E Q [lnZ fe ] - ^E Q [& fc ]E Q [Z fc ] 

fc=l k=l 
T 

+ ^M[Q(Z fe )] + const. (47) 

k=l 

The intractable term, — A^Eg[ln^ fc Z k ], is replaced with the bound in Equation (44). 

Rather than calculate for a specific q distribution on Z k , we use the procedure discussed by 
Winn and Bishop (2005) for finding the optimal form and parameterization of a given q: We 
exponentiate the variational lower bound in Equation (47) with all expectations involving the 
parameter of interest not taken. For Zk, this gives 

q(Z k ) oc e E Q-z k ^r( z ^^T,bi-.T)] 

x z E Q [a fe ]+E^ =1 Pg(C Il =fc)-l e -(E Q {b k ]+N/Z)Z K ^ 

Therefore, the optimal q distribution for Z k is q(Z k ) = Gamma(Z k \a' k , b' k ) with a' k = EQ[a k \ + 
£„=i F Q( C n = k) and b' k = E Q [b k ] + N/£. The specific values of a' k and b' k for DILN are given 
in Equation (23). 
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